Multivariate Adaptive Regression Splines

  • Splines

  • Regression

  • Adaptive

  • Multivariate

Splines

What are Splines?

  • They are flexible lines/curves that are combined together

  • Individual splines components are generally relevant to only part of the space/range of a feature or a set of features

  • They have continuity constraints (joined together - no jumps), and smoothness constraints (e.g. up to second derivative continuity at a join)

Splines are flexible

Interpolation Example

Regression

Wages dataset from Intro to Stat Learning

library(tidyverse)

require(ISLR)
?Wage #for more details on the dataset

# Exclude very high salaries
Wage <- Wage %>% filter(wage < 250)
p <- ggplot(Wage, aes(x=age,y=wage)) + geom_point(colour='blue')
p

Wage v. Age

Local means

Local linear models

Linear Spline

Cubic Spline

Twitter’s View

Adaptive

Toy Data

20 Knots / 21 Segments, piecewise linear

5 Knots / 6 Segments, piecewise linear

2 Knots / 3 Segments, piecewise linear

Adaptive

model complexity depends on the data

  • MARS tries to use as small number of basis functions as ‘nessesary’
  • It is adaptive to the data, using only the features and flexibility required
  • Uses a build phase (adding basis functions) and a pruning phase (removing basis functions)
  • This applies to the features and the knots (breakpoints) used within a feature

Multivariate

Multivariate

  • Splines can be fit with the same machinery as multivariate linear models, these use a ‘design’ or ‘model’ matrix

  • Each column of your model matrix corresponds to a basis function, more on this later

  • Interactions are modelled using tensor product. This is equivalent to including a column with element-wise multiplication of other columns (same as in standard multivariate linear regression)

Interaction term present (with knotted term)

Piecewise linear fitting challenge: cos(x)*cos(y)

Use 20 knots + interactions, 483 basis functions

Limited to about 100 basis functions

Fitting details

Fitting Splines in the Linear Model framework

  • To fit a linear piecewise model we can use the machinery of linear models

  • We use spline basis functions

  • Run a feature through the basis function. This is the same idea as when you would square a feature, or log it, truncate to max value before using in linear model

  • Use the resulting features as the columns of the model matrix \(\bf{X}\) in your linear model ‘\(\bf{y = a + Xb + \epsilon}\)’

  • Class attributes can be added to model matrix with usual one-hot encoding

  • It is a generalisation of linear modelling, MARS assumes Gaussian errors

MARS basis functions

  • The type of basis functions used in MARS are of the form

  • \(\bf{(x-t_{x3})_+}\)

  • \(\bf{(t_{x3}-x)_+}\)

and interactions such as:

  • \(\bf{x \cdot y}\)
  • \(\bf{x\cdot(y-t_{x4})_+}\)
  • \(\bf{(x-t_{x3})_+\cdot(y-t_{y2})_+}\)

where \(\bf{x}\), \(\bf{y}\) are features and \(\bf{t_{x3}}\) etc. are knots (actual values in the range of the feature).

Single basis function: \((x-t_{1})_+\)

MARS pairs: \((x-t_{1})_+\); \((t_{1}-x)_+\)

\((x-t_{1})_+\); \((t_{1}-x)_+\); \((x-t_{2})_+\); \((t_{2}-x)_+\)

Model Matrix

Intercept x (x - -3.3)+ (-3.3 - x)+ (x - 3.3)+ (3.3 - x)+
1 -10.0 0.0 6.7 0.0 13.3
1 -8.0 0.0 4.7 0.0 11.3
1 -6.0 0.0 2.7 0.0 9.3
1 -3.9 0.0 0.6 0.0 7.2
1 -3.3 0.0 0.0 0.0 6.6
1 -1.9 1.4 0.0 0.0 5.2
1 0.1 3.4 0.0 0.0 3.2
1 2.1 5.4 0.0 0.0 1.2
1 3.3 6.6 0.0 0.0 0.0
1 4.1 7.4 0.0 0.8 0.0
1 6.2 9.5 0.0 2.8 0.0
1 8.2 11.5 0.0 4.8 0.0
1 10.0 13.3 0.0 6.7 0.0

Adaptive fitting, Building phase

  • Start with a base model (intercept)

  • Scan through all candidate basis functions (all features and all knots)

  • At each stage we consider the product of basis functions in the model with candiate basis functions

  • One restriction is that a feature can appear only once in a term (interactions ok, higher order powers are not)

  • Usually restrictions are put on the degree of interactions

  • Add the basis functions pair which most reduces the loss (RSS)

  • Continue until a stopping criterion is TRUE (e.g. max number of basis functions, reduction in RSS is very small, numerical stability)

Adaptive fitting, Pruning phase

  • Remove basis functions one-by-one, each time choosing the one with the least reduction in Residual Error (the most useless)

  • The ‘Best’ model comapring all reduction phase models is found by (default) GCV Generalised Cross Validation which is similar to AIC or BIC. GCV prefers lower RSS but penalises more complex models over less

  • GCV stands for Generalized Cross Validation, a perhaps misleading term because no cross-validation is actually performed

  • One can also use a test set to pick the best model, or real cross validation - depending on implementation

Packages that implement MARS

Important point

  • The term “MARS” is trademarked and licensed exclusively to Salford Systems: https://www.salford-systems.com. We can use MARS as an abbreviation; however, it cannot be used for competing software solutions.

  • The only official MARS is from Salford Systems

PolyMARS (Elements of Statistical Learning)

PolyMARS

  • This was published as a package in Splus before the trademark restrictions came in

  • Coded in C and S

  • Slightly different algorithm to real MARS

  • Allowed for multiple responses (polychotomous), hence classification

  • Although open-source, Salford System licenced the commerical use of the code for 10 years

  • The package was ported in R, eventually the functionality was incorporated into a more general spline package called polspline

PolyMARS in the wild

PolyMARS in the wild

MDA

  • MDA is a package that originally implemented MARS in R, authors Hastie & Tibshirani

  • The package is focussed on Mulivariate Discriminant Analysis but has full MARS functionality

Earth

  • Earth package (in R and Python) is probably the best version to use

Earth

  • Implements MARS

  • Does Polycotomous Regression (so Classification)

  • Implements many Loss functions using the machinery of GLMs

    • Binomial
    • Poisson
    • Gamma
    • Tweedie
    • inverse.gaussian

This means successes; counts; insurances losses; and randomly large positive amounts, etc. can be modelled

A few example datasets

Boston Housing Data

Boston Housing Data

library('tidyverse')
library('earth')
library('caret')

input_dir <- '/Users/Martin/Google Drive/Projects/MARS'
data_file <- "housing.data"

housing <- read_table(file.path(input_dir, data_file), col_names = FALSE) 
colnames(housing) <- c('CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV')

set.seed(92835)
training_idx <- createDataPartition(housing$MEDV, p=0.7, list=FALSE)
train_data <- housing %>% slice(training_idx)

emars.fit <- earth(MEDV ~ . , degree=2, trace= 3 , data = train_data)

Boston Housing Data: summary()

Boston Housing Data: plotmo()

Boston Housing Data: test predict v. actual

Boston Housing Data: test predict v. residuals

MNIST

MNIST

MNIST

  • Degree 1, multiclass Earth; 10 of 784 predictors, 21 terms: 66%

  • Degree 3, multiclass Earth; 20 of 784 predictors, 41 terms :74%

  • Degree 4, multiclass Earth; 34 of 784 preds, 77 terms: 78%

  • Degree 5, multiclass Earth; 14 of 784 predictors; 27 terms : 64%

MNIST Degree 4 MARS Model basis functions

h(5-X156) h(X410-5) h(X184-7) * h(X272-5) h(X436-6) * h(X540-5)
h(X156-5) h(4-X457) h(X184-7) * h(5-X272) h(X457-4) * h(X570-6)
h(6-X183) h(X457-4) h(6-X210) * h(X213-6) h(X457-4) * h(6-X570)
h(X183-6) h(5-X540) h(X210-6) * h(X213-6) h(42-X543) * h(X625-2)
h(7-X184) h(X540-5) h(X211-6) * h(X491-7) h(42-X543) * h(2-X625)
h(X184-7) h(42-X543) h(X211-6) * h(7-X491) h(X543-42) * h(X657-53)
h(6-X211) h(X543-42) h(4-X291) * h(X379-45) h(X543-42) * h(53-X657)
h(X211-6) h(12-X156) * h(X184-7) h(X291-4) * h(X379-45) h(X210-6) * h(X213-6) * h(X599-7)
h(6-X213) h(X156-12) * h(X184-7) h(4-X347) * h(X543-42) h(X210-6) * h(X213-6) * h(7-X599)
h(X213-6) h(X156-5) * h(X490-7) h(X347-4) * h(X543-42) h(2-X243) * h(X543-42) * h(53-X657)
h(6-X239) h(X156-5) * h(7-X490) h(3-X349) * h(43-X374) h(X243-2) * h(X543-42) * h(53-X657)
h(X239-6) h(13-X183) * h(X211-6) h(X349-3) * h(43-X374) h(4-X291) * h(X379-45) * h(X523-3)
h(6-X240) h(X183-13) * h(X211-6) h(45-X379) * h(X437-62) h(4-X291) * h(X379-45) * h(3-X523)
h(X240-6) h(X183-6) * h(X463-7) h(45-X379) * h(62-X437) h(4-X406) * h(42-X543) * h(2-X625)
h(43-X374) h(X183-6) * h(7-X463) h(45-X379) * h(X461-49) h(X406-4) * h(42-X543) * h(2-X625)
h(X374-43) h(X184-7) * h(X240-9) h(45-X379) * h(49-X461) h(110-X182) * h(X210-6) * h(X213-6) * h(X599-7)
h(45-X379) h(X184-7) * h(9-X240) h(X379-45) * h(X490-54) h(X182-110) * h(X210-6) * h(X213-6) * h(X599-7)
h(X379-45) h(X184-7) * h(X271-5) h(X379-45) * h(54-X490) h(X243-2) * h(X493-10) * h(X543-42) * h(53-X657)
h(5-X410) h(X184-7) * h(5-X271) h(6-X436) * h(X540-5) h(X243-2) * h(10-X493) * h(X543-42) * h(53-X657)

MNIST

  • Preprocessed the data with de-skewing

  • Used 10 model ensemble of one-versus-rest binomial loss EARTH models

  • Balanced data in each single model through down-sampling

  • Maximum degree of interaction permitted: 5

  • Test set accuracy of 95.4% - not bad

  • But don’t use a model like this for image analysis

One more task: Wine

  • https://archive.ics.uci.edu/ml/datasets/Wine

  • These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.

1 Alcohol 2 Malic acid 3 Ash 4 Alcalinity of ash 5 Magnesium 6 Total phenols 7 Flavanoids 8 Nonflavanoid phenols 9 Proanthocyanins 10 Color intensity 11 Hue 12 OD280/OD315 of diluted wines 13 Proline

  • Number of Instances - Class 1 has 59; Class 2 has 71; Class 3 has 48

  • Train/test : 70% / 30%

Wine - Earth - R code

library('earth')

...        

emars_fit <- earth(x = train_feat, y = train_label_one_hot, degree=2, trace= 4 )
e_test_pred <- apply(predict(emars_fit,newdata = test_feat),1,which.max)
e_confusion_tab <- table(e_test_pred,test_label)

Wine - Earth - R code

library('polspline')

...

pmars.fit <- polymars(resp=train_label,pred = train_feat, verbose=TRUE, classify = T)
p_test_pred <- predict.polymars(pmars.fit, x = test_feat, classify = TRUE)
p_confusion_tab <- table(p_test_pred,test_label)

Wine - Test Set Confusion Matrix



EARTH
1 2 3
1 17 1 0
2 2 19 0
3 0 0 14



PolyMARS
1 2 3
1 18 1 0
2 1 19 0
3 0 0 14

Conclusion

  • MARS by way of EARTH is a handy package for whipping up a linear/GLM model

  • Anywhere you use a linear model, MARS can quickly provide

    • Feature selection
    • Feature Engineering
    • Interaction selection
    • and a decent model
  • Many extra features we didn’t cover like fine grained model search such as: terms not allowed to interact, start model etc.

  • Good documentation

Conclusion

  • Works with mixed data-types, continuous and class (one-hot encoding)

  • Simple to set up, not many configuration options

  • Works across a number of GLM types (Tweedie, Gamma, Binomial, …) as well as Gaussian (MSE)

  • Can get you most, if not all of the way there, and without much fuss

  • or it can provide a good initial benchmark model if you are going to invest a lot of time in other forms of models